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Abstract 

Econophysics, is based on the premise that some ideas and methods from physics 
can be applied to economic situations. We intend to show in this paper how a 
physics concept such as entropy can be applied to an economic problem. In so 
doing, we demonstrate how information in the form of observable data and moment 
constraints are introduced into the method of Maximum relative Entropy (MrE). 
A general example of updating with data and moments is shown. Two specific 
econometric examples are solved in detail which can then be used as templates for 
real world problems. A numerical example is compared to a large deviation solution 
which illustrates some of the advantages of the MrE method. 
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1 Introduction 

Methods of inference are not new to econometrics. In fact, one could say that 
the subject is founded on inference methods. Econophysics, on the other hand, 
is a much newer idea. It is based on the premise that some ideas and methods 
from physics can be applied to economic situations. In this paper we aim to 
show how a physics concept such as entropy can be applied to an economic 
problem. 

In 1957, Jaynes pQ showed that maximizing statistical mechanic entropy for 
the purpose of revealing how gas molecules were distributed was simply the 
maximizing of Shannon's information entropy [2] with statistical mechanical 
information. The method was true for assigning probabilities regardless of the 
information specifics. This idea lead to MaxEnt or his use of the Method of 
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Maximum Entropy for assigning probabilities. This method has evolved to 
a more general method, the method of Maximum (relative) Entropy (MrE) 
PHIS] which has the advantage of not only assigning probabilities but updat- 
ing them when new information is given in the form of constraints on the 
family of allowed posteriors. One of the draw backs of the MaxEnt method 
was the inability to include data. When data was present, one used Bayesian 
methods. The methods were combined in such a way that MaxEnt was used 
for assigning a prior for Bayesian methods, as Bayesian methods could not 
deal with information in the form of constraints, such as expected values. The 
main purpose of this paper is to show both general and specific examples of 
how the MrE method can be applied using data and moment 

The numerical example in this paper addresses a recent paper by Grendar 
and Judge (GJ) [B] where they consider the problem of criterion choice in 
the context of large deviations (LD). Specifically, they attempt to justify the 
method by Owen [7] in a LD context with a new method of their own. They 
support this idea by citing a paper in the econometric literature by Kitamura 
and Stutzer [5] who also use LD to justify a particular empirical estimator. 
We attempt to simplify their (GJ) initial problem by providing an example 
that is a bit more practical. However, our example has the same issue; what 
does one do when one has information in the form of an "average" of a large 
data set and a small sample of that data set? We will show by example that 
the LD approach is a special case of our method, the method of Maximum 
(relative) Entropy 

In section 2 we show a general example of updating simultaneously with two 
different forms of information: moments and data. The solution resembles 
Bayes' Rule. In fact, if there are no moment constraints then the method 
produces Bayes rule exactly [S] . If there is no data, then the MaxEnt solution is 
produced. The realization that MrE includes not just MaxEnt but also Bayes' 
rule as special cases is highly significant. It implies that MrE is capable of 
producing every aspect of orthodox Bayesian inference and proves the complete 
compatibility of Bayesian and entropy methods. Further, it opens the door 
to tackling problems that could not be addressed by either the MaxEnt or 
orthodox Bayesian methods individually; problems in which one has data and 
moment constraints. 

In section 3 we comment on the problem of non-commuting constraints. We 
discuss the question of whether they should be processed simultaneously, or 
sequentially, and in what order. Our general conclusion is that these different 
alternatives correspond to different states of information and accordingly we 



The constraints that we will be dealing with are more general than moments, they 
are actually expected values. For simplicity we will refer to these expected values 
as moments. 
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expect that they will lead to different inferences. 

In section 4, we provide two toy examples that illustrate potential economic 
problems similar to the ones discussed in G J. The two examples (ill-behaved as 
mentioned in GJ) are solved in detail. The first example will demonstrate how 
data and moments can be processed sequentially. This example is typically how 
Bayesian statistics traditionally uses MaxEnt principles where MaxEnt is used 
to create a prior for the Bayesian formulation. The second example illustrates a 
problem that Bayes and MaxEnt alone cannot handle: simultaneous processing 
of data and moments. These two examples will seem trivially different but this 
is deceiving. They actually ask and answer two completely different questions. 
It is this 'triviality' that is often a source of confusion in Bayesian literature 
and therefore we wish to expose it. 

In section 6 we compare a numerical example that is solved by MrE and one 
that is solved by GJ's method. Since GJ's solution comes out of LD, they 
rely on asymptotic arguments; one assumes an infinite sample set which is 
not necessarily realistic. The MrE method does not need such assumptions to 
work and therefore can process finite amounts of data well. However, when 
MrE is taken to asymptotic limits one recovers the same solutions that the 
large deviation methods produce. 



2 Simultaneous updating with moments and data 



Our first concern when using the MrE method to update from a prior to 
a posterior distribution is to define the space in which the search for the 
posterior will be conducted. We wish to infer something about the values of 
one or several quantities, 9 G G, on the basis of three pieces of information: 
prior information about 9 (the prior), the known relationship between x and 
9 (the model), and the observed values of the data x 6 X. Since we are 
concerned with both x and 9, the relevant space is neither X nor G but the 
product X x G and our attention must be focused on the joint distribution 
P(x,9). The selected joint posterior P QCW (x,9) is that which maximizes the 
entropjO], 

S[P, P old ] = - / dxdff P (x, 9) log -Pr\ > C 1 ) 
J P i d (x,6) 

subject to the appropriate constraints. P Q \d (x, 9) contains our prior informa- 
tion which we call the joint prior. To be explicit, 

P i d (x,9)=P old (9)P old (x\9) , (2) 



2 In the MrE terminology, we "maximize" the negative relative entropy, S so that 
S < 0. This is the same as minimizing the relative entropy. 
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where P ia (0) is the traditional Bayesian prior and P Q id is the likelihood. 
It is important to note that they both contain prior information. The Bayesian 
prior is defined as containing prior information. However, the likelihood is not 
traditionally thought of in terms of prior information. Of course it is reasonable 
to see it as such because the likelihood represents the model (the relationship 
between 9 and x) that has already been established. Thus we consider both 
pieces, the Bayesian prior and the likelihood to be prior information. 

The new information is the observed data, x', which in the MrE framework 
must be expressed in the form of a constraint on the allowed posteriors. The 
family of posteriors that reflects the fact that x is now known to be x' is such 
that 

P(x) = J d9 P(x,0) = 6(x - x) , (3) 

where 6 (x — x') is the Dirac delta function. This amounts to an infinite num- 
ber of constraints: there is one constraint on P (x, 9) for each value of the 
variable x and each constraint will require its own Lagrange multiplier X(x). 
Furthermore, we impose the usual normalization constraint, 

J dxdO P(x,9) = l, (4) 

and include additional information about 9 in the form of a constraint on the 
expected value of some function / (#), 

Jdxd9P(x,9)f(9) = (f(9)) = F. (5) 

Note: an additional constraint in the form of / dxd9P(x, 9)g(x) = (g) = G 
could only be used when it does not contradict the data constraint ([3]). There- 
fore, it is redundant and the constraint would simply get absorbed when solv- 
ing for \{x). We also emphasize that constraints imposed at the level of the 
prior need not be satisfied by the posterior. What we do here differs from the 
standard Bayesian practice in that we require the constraint to be satisfied by 
the posterior distribution. 

We proceed by maximizing (JT]) subject to the above constraints. The purpose of 
maximizing the entropy is to determine the value for P when 5 = 0. Meaning, 
we want the value of P that is closest to P Q \& given the constraints. The 
calculus of variations is used to do this by varying P — ► SP, i.e. setting the 
derivative with respect to P equal to zero. The Lagrange multipliers a, f3 and 
\{x) are used so that the P that is chosen satisfies the constraint equations. 
The actual values are determined by the value of the constraints themselves. 
We now provide the detailed steps in this maximization process. 
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First we setup the variational form with the Lagrange multipiers, 



SP (x, i 



S[P, P i d ] +a[J dxdOP {x, 9) - 1] 
+P [J dxdOP (x, 9) f (9) - F] 
+ J dx\(x) [J d6P (x, 6)-8{x- x)} 



We expand the entropy function ([!]), 



SP (x,6) < 



-JdxdO P(x,6)logP(x,6) 
+ J dxdO P (x, 9) log P id (x, 9) 

+a [J dxd9P (x, 9) - 1] 
+P [J dxd9P (x, 9) f (9) - F] 
+ J dx\{x) [J d9P (x, 9)-5{x- x)] 



(6) 



(7) 



Next, vary the functions with respect to P (x, 9) , 

- / dxd9 5P (x, 9) log P (x, 9)- J dxd9 P (x, 9) p^SP (x, 9) 
+ / dxd9 5P {x, 9) log P old {x, 9) + 

+a[Jdxd9 5P{x,9)] > = (8) 

+P [J dxd9 5P (x, 9) f 
+ Jdx\(x) [Jd9 5P (x, 

which can be rewritten as 

J dxd9 {- \ogP(x, 9)-l + logP old (x, 9) + a + pf (9) + X(x)} 5P (x, 9) = . 



The terms inside the brackets must sum to zero, therefore we can write, 
logP (x, 9) = logP old (x,9) - 1 + a + Pf (9) + X(x) 



(9) 



or 



Picw (x, 9) = P i d (x, 9) e 



(-l+a+Pf[e)+\(x)) 



(10) 



In order to determine the Lagrange multipliers, we substitute our solution 
(11 01) into the various constraint equations. The constant a is eleiminated by 
substituting flT0} into (jU), 



J dxd9 P old (x, 9) e (-i+«+/V(*)+A(*)) = i . 
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Dividing both sides by the constant e' 1+a \ 

dxd6 P old (x, 9) e ps{e)+x{x) = e (1 - Q) . (12) 
Then substituting back into (1T0]) yields 

Pncw 9) = Poid (x, 9) , (13) 

where 

Z = e l ~ a = [ dxd9e pm+x{x) P oXA (x, 9) . (14) 



In the same fashion, the Lagrange multipliers X(x) are determined by substi- 
tuting (USD into P 

J d6 P old (x, 9) = 5(x- x') (15) 

or 

= S^mp M S{x - xy (16) 



The posterior now becomes 



e l3f(S) 



P new (x, 9) = P old (x, 9) 5(x - x)—— , (17) 
where ((x, f3) = J d9e^P olA (x, 9) . 

The Lagrange multiplier (3 is determined by first substituting ( jTTl) into ([51), 



dxd9 



P i d (x, 9)8{x-x) 



e 0f(8) 

CM 



f(0) = F. (18) 



Integrating over x yields, 

Jrfge^Wp o i d (s',fl)/(fl) 



(19) 



where ((x,f3) ~~ * C( x 'iP) = / d9e (3 ^ e 'P \d(x' , 6). Now /3 can be determined 
rewriting ( 1T91) as 

91nC(x',/3) 



9/3 



P . (20) 



The final step is to marginalize the posterior, P new (x, 9) over x to get our 
updated probability, 

Pne w (0)=Pold(x',9)^ r ^ (21) 
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Additionally, this result can be rewritten using the product rule (P (x, 9) = 
P{x)P{9\x)) as 



PncwW = Pold (0) Pold(x'\e) - , (22) 

S \ x ■> P) 

where ('(x',f3) = J d9e /3 ^ e "'P id (9) P Q \d{x'\9). The right side resembles Bayes 
theorem, where the term P Q \d(x'\9) is the standard Bayesian likelihood and 
-fold ($) is the prior. The exponential term is a modification to these two 
terms. In an effort to put some names to these pieces we will call the stan- 
dard Bayesian likelihood the likelihood and the exponential part the likelihood 
modifier so that the product of the two gives the modified likelihood. The de- 
nominator is the normalization or marginal modified likelihood. Notice when 
(3 = (no moment constraint) we recover Bayes' rule. For (3 ^ Bayes' rule 
is modified by a "canonical" exponential factor. 



3 Commutivity of constraints 



When we are confronted with several constraints, such as in the previous 
section, we must be particularly cautious. In what order should they be pro- 
cessed? Or should they be processed at the same time? The answer depends 
on the nature of the constraints and the question being asked [9]. 

We refer to constraints as commuting when it makes no difference whether 
they are processed simultaneously or sequentially. The most common example 
of commuting constraints is Bayesian updating on the basis of data collected 
in multiple experiments. For the purpose of inferring 9 it is well known that 
the order in which the observed data x' = {x[,x 2 , . . .} is processed does not 
matter. The proof that MrE is completely compatible with Bayes' rule implies 
that data constraints implemented through 8 functions, as in ([3]), commute 
just as they do in Bayes. 

It is important to note that when an experiment is repeated it is common 
to refer to the value of x in the first experiment and the value of x in the 
second experiment. This is a dangerous practice because it obscures the fact 
that we are actually talking about two separate variables. We do not deal 
with a single x but with a composite x = (xi,x 2 ) and the relevant space is 
X\ x X 2 x 0. After the first experiment yields the value x[, represented by the 
constraint c\ : P{x\) = 8 (x\ — x[), we can perform a second experiment that 
yields x' 2 and is represented by a second constraint c 2 ■ P(x 2 ) = 8 (x 2 — x' 2 ). 
These constraints c\ and c 2 commute because they refer to different variables 
x\ and x 2 
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Fig. 1. Illustrating the difference between processing two constraints C\ and 
Ci sequentially (P w —> Pi - * Pnew) and simultaneously (P id — * Pnew or 

Pold -Pl — *• Pnew)- 

As a side note, use of a 5 function has been criticized in that by implementing 
it, the probability is completely constrained, thus it cannot be updated by 
future information. This is certainly true! An experiment, once performed 
and its outcome observed, cannot be un-performed and its result cannot be 
un-observed by subsequent experiments. Thus, imposing one constraint does 
not imply a revision of the other. 

In general constraints need not commute and when this is the case the order in 
which they are processed is critical. For example, suppose the prior is P Q id and 
we receive information in the form of a constraint, C\. To update we maximize 
the entropy S[P, P \d] subject to C\ leading to the posterior Pi as shown in Fig 
1. Next we receive a second piece of information described by the constraint 
C 2 . At this point we can proceed in essentially two different ways: 

a) Sequential updating - 

Having processed C±, we use Pi as the current prior and maximize S[P, Pi] 
subject to the new constraint C 2 . This leads us to the posterior P^t- 

b) Simultaneous updating - 

Use the original prior P old and maximize S[P, P Q id] subject to both constraints 
Ci and C 2 simultaneously. This leads to the posterior P$ w . At first sight it 
might appear that there exists a third possibility of simultaneous updating: (c) 
use Pi as the current prior and maximize S[P, Pi] subject to both constraints 
Ci and Ci simultaneously. Fortunately, and this is a valuable check for the 
consistency of the ME method, it is easy to show that case (c) is equivalent 
to case (b). Whether we update from P old or from P 1 the selected posterior is 

p(b) 
new 

To decide which path (a) or (b) is appropriate, we must be clear about how the 
MrE method treats constraints. The MrE machinery interprets a constraint 



8 



such as C\ in a very mechanical way: all distributions satisfying C\ are in 
principle allowed and all distributions violating C\ are ruled out. 

Updating to a posterior Pi consists precisely in revising those aspects of the 
prior P id that disagree with the new constraint C\. However, there is nothing 
final about the distribution P\. It is just the best we can do in our current 
state of knowledge and we fully expect that future information may require 
us to revise it further. Indeed, when new information C 2 is received we must 
reconsider whether the original C\ remains valid or not. Are all distributions 
satisfying the new C 2 really allowed, even those that violate C{1 If this is the 
case then the new C 2 takes over and we update from Pi to P^t- The constraint 
C\ may still retain some lingering effect on the posterior P^w through P 1 , but 
in general C\ has now become obsolete. 

Alternatively, we may decide that the old constraint C\ retains its validity. 
The new C 2 is not meant to revise C\ but to provide an additional refinement 
of the family of allowed posteriors. In this case the constraint that correctly 
reflects the new information is not C 2 but the more restrictive space where 
C\ and C 2 overlap. The two constraints should be processed simultaneously 
to arrive at the correct posterior P$ w . 

To summarize: sequential updating is appropriate when old constraints be- 
come obsolete and are superseded by new information; simultaneous updating 
is appropriate when old constraints remain valid. The two cases refer to dif- 
ferent states of information and therefore we expect that they will result in 
different inferences. These comments are meant to underscore the importance 
of understanding what information is being processed; failure to do so will 
lead to errors that do not reflect a shortcoming of the MrE method but rather 
a misapplication of it. 



4 An econometric problem: sequential updating 

This is an example of a problem using the MrE method:. The general back- 
ground information is that a factory makes k different kinds of bouncy balls. 
For reference, they assign each different kind with a number, fi, f 2 , ■■■fk- They 
ship large boxes of them out to stores. Unfortunately, there is no mechanism 
that regulates how many of each ball goes into the boxes, therefore we do not 
know the amount of each kind of ball in any of the boxes. 

For this problem we are informed that the company does know the average 
of all the kinds of balls, F that is produced by the factory over the time that 
they have been in existence. This is information about the factory. By using 
this information with MrE we get what one would get with the old MaxEnt 
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method, a distribution of balls for the whole factory. 

However, we would like to know the probability of getting a certain kind of 
ball in a particular box. Therefore, we are allowed to randomly select a few 
balls, n from the particular box in question and count how many of each kind 
we get, mi, m 2 ...mfc (or perhaps we simply open the box and look at the balls 
on the surface). This is information about the particular box. Now let us put 
the above example in a more mathematical format. 

Let the set of possible outcomes be represented by, fl = {fi, fi, ■■■fk\ from a 
sample where the total number of balls, N — > od 3 1 and whose sample average 
is F. Further, let us draw a data sample of size n, from a particular subset 
of the original sample, u where w G O and whose outcomes are counted 
and represented as m = (mi,m2...mt) where n = J2i m i- We would like to 
determine the probability of getting any particular type in one draw [9 = 
02---9k}) ou t °f t ne subset given the information. To do this we start with 
the appropriate joint entropy, 

S[P, Poid] =-£ / deP(m, 9\n) log (23) 
m J Poid{m,6\n) 

We then maximize this entropy with respect to P(m, 9\n) to process the first 
piece of information that we have which is the moment constraint, C\ that is 
related to the factory, 

C 1 :(f(9)) = F where / (9) = fA , (24) 

subject to normalization, where 9 = {9i,9 2 ...9k}, m = (mi...mk) and where3 

E= t sCzl^-n) , (25) 

rn mi...mt=0 ^ ' 

^^(EL^- 1 ) • ( 26 ) 

e A/(0) 

P^km,Q\n)-— , (27) 
^1 



3 It is not necessary for N — > oo for the ME method to work. We simply wish to 
use the description of the problem that is common in information-theoretic exam- 
ples. It must be strongly noted however that in general a sample average is not an 
expectation value. 

4 The use of the 5 function in both (18) and (19) are used to clarify the summation 
notaion used in (16). They are not information constraints. on P as in (18) and later 
(25). 



and 



This yields, 



Jd9 = J d9 x 



Pi (in, 9\n) 
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where the normalization constant Z\ and the Lagrange multiplier A are deter- 
mined from 

Z x = Jd9e^P old {9\n) and = F . (28) 

We need to determine what to use for our joint prior, 

P old (m, 9\n) = P old (m'\9, n)P old (9\n) (29) 

in our problem. The mathematical representation of the situation where we 
wish to know the probability of selecting rrii balls of the i th type from a sample 
of n balls of A;-types is simply the multinomial distribution. Therefore, the 
equation that we will use for our model, the likelihood, P \ d (m'\9,n) is, 

77 ' 

P old (m 1 ...m k \9 1 ...9 k ,n) = , - t ^-K" ■ ( 30 ) 

Since at this point we are completely ignorant of 9, we use a prior that is 
flat, thus P \ d (6\n) = constant. Being a constant, the prior can come out of 
the integral and cancels with the same constant in the numerator. (Also, the 
particular form of P \d{9\n) is not important for our current purpose so for the 
sake of definiteness we can choose it flat for our example. There are most likely 
better choices for priors, such as a Jeffrey's prior.) Thus, after marginalizing 
over m, the joint distribution ( l27j) can be rewritten as 

p a/(0) 

PM = V- • ( 31 ) 

Now we wish to process the next piece of information which is the data con- 
straint, 

C 2 : P(m) = 5 mm , . (32) 

Here we use a Kronecker delta function since m is discrete in this example. Our 
goal is to infer the 9 that apply to our particular box. The original constraint 
C\ applies to the whole factory while the new constraint Ci refers to the actual 
box of interest and thus takes precedence over C%. As n — > oo we expect C\ 
to become less and less relevant. Therefore the two constraints should be 
processed sequentially. 

We maximize again with our new information which yields, 

P£l(m,9) = 8 mm/ P 1 (9\m) . (33) 
Marginalizing over m and using (!3~T1) the final posterior for 9 is 

e A/(0) 

P&V) = PMm') = P<M{m'\9)— . (34) 

^2 
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where 



Z 2 = J d9e Xf ^P old (m'\9) 



(35) 



Those familiar with using MaxEnt and Bayes will undoubtedly recognize that 
(1511) is precisely the result obtained by using MaxEnt to obtain a prior, in 
this case P\{9) given in (1311) . and then using Bayes' rule to take the data 
into account. This familiar result has been derived in detail for two reasons: 
first, to reassure the readers that MrE does reproduce the standard solutions 
to standard problems and second, to establish a contrast with the example 
discussed next. NOTE: Since the constraints C\ and C 2 do not commute one 
will get a different result if they are processed in a different order. 



5 An econometric problem: simultaneous updating 

This is another example of a problem using the MrE method:. The general 
background information is the same as the previous example. For this problem 
we are informed that the company knows the average of all the kinds of balls, 
F in each box. By using this information with MrE we get what one would 
get with the old MaxEnt method, a distribution of balls for each box. 

However, we still would like to know the probability of getting a certain kind 
of ball in a particular box and we are allowed to randomly select a few balls, 
n from the particular box in question once again. Since both of these pieces of 
information apply to the same box, they must be processed simultaneously. 
In other words, both constraints must hold, always. We proceed as in the 
first example by maximizing (1231) subject to normalization and the following 
constraints simultaneously, 



(notice C3 7^ C\ because they are two difference pieces of information) and 



C 3 :(f(9)) = F where /(0)=£*/i*- 



(36) 



C 2 : P{m) 



5, 



mm' 



(37) 



This yields, 



P 1 (9\m') 



Pom(m'\9) 



e M0) 



(38) 



c 



where 




(39) 



and 



F 



<91ogC 

d/3 



(40) 
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This looks like the sequential case but there is a crucial difference: (3 ^ A 
and ( 7^ Z 2 . In the sequential updating case, the multiplier A is chosen so 
that the intermediate P 1 satisfies G\ while the posterior P^l only satisfies 
C 2 . In the simultaneous updating case the multiplier (3 is chosen so that the 
posterior P^ w satisfies both C\ and C 2 or C\ A C 2 . Ultimately, the two dis- 
tributions P n ew(0) are different because they refer to different problems. For 
more examples using this method see [9]. 



6 Numerical examples 



The purpose of this section is two fold: First, we would like to provide a 
numerical example of a MrE solution. Second, we wish to examine a current, 
relevant econometric solution proposed by GJ in [6] using the method of types, 
specifically large deviation theory, for an "ill-posed" problem that is similar to 
the one discussed in section 5. This solution will be compared with a solution 
using MrE. 

To summarize the problem once again: The factory makes k different kinds of 
bouncy balls and for reference, they assign each different type with a number, 
/i, f 2 , ...fk- We are informed that the company knows the expected type of 
ball, F in each box over the time that they have been in existence. We would 
like a better idea of how many balls are in each box so we randomly select a 
few balls, n from a particular box and count how many of each type we get, 
mi 1 m 2 ...m k . 

Or stated in a more mathematical format: Let the set of possible outcomes of 
a be represented by, Q = {/i, f 2 , ...fk} from a sample where the total number 
of balls, iV —>■ oo. and where the average of the types of balls is F. Further, 
let us draw a data sample of size n, from the original sample, whose outcomes 
are counted and represented as m = (m 1 ,m 2 ...m);) where n = J2i m i- The 
problem becomes ill-posed when the sample average of the counts 

Sav 9 = - f iirLi ( 41 ) 

^ i 

significantly deviates from the expected average of the types, F. 

We would like to determine the probability of getting any particular outcome 
in one draw {6 = {0i,9 2 ...9k}) given the information. 
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6.1 Sanov's theorem solution 



In [6] a form of Sanov's theorem is used. Here we give a brief description of 
Sanov's theorem. It is not intended to be a proof or exhaustive. It is simply 
shown to give a general indication of the basis for the solution in pjj. The key 
equation is (|46|) . For a more detailed proof and explanation see |10j . 

Sanov's theorem - 

Let Xi . . . X n be independent and identically distributed (i.i.d.) with values 
in an arbitrary set x with common distribution Q(x). Let E C V be a set of 
probability distributions. Then, 

Q n (E) = Q n (E n V n ) <(n + i)!*^-"^*!^) , (42) 

where 

P* = argminL>(P||Q) (43) 

is the distribution in E that is closest to Q in the relative entropy or informa- 
tion divergence, 

D(P\\Q) = Y,dxP(x)\og^- (44) 

and n is the number of types. If in addition, the set E is the closure of its 
interior, 

-logQ n (E)^-D(P*\\Q) . (45) 
n 

The two equations become equal in the asymptotic limit. Essentially what this 
theorem says is that in the asymptotic limit, the frequency of the sample Q, 
can be used to produce an estimate, P* of the "true" probability, V by way 
of minimizing the relative entropy 



For our problem, the solution for the probability using Sanov is of the form, 

where Q for our problem is the frequency of the counts, m/n and rj is a 
Lagrange multiplier that is determined by the sample average (1411) . not an 
expected value as in our method. This solution seems very similar to our 
general solution using the MrE method (|38p in which we also minimize an 
entropy (maximize our negative relative entropy). We could even think of Q 
as a kind of joint prior and likelihood. However, there are many differences in 
the two methods, but the most glaring is that the GJ solution is only valid in 
the asymptotic case. We are not handicapped by this when MrE is used. 
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Fig. 2. This figure shows the relationship between (3 and = F((3)). Notice that 
as the value for $ approaches the extremities of the outcomes, f3 approaches infinity. 

6.2 Comparing the methods 

We illustrate the differences between the methods be examining a specific 
version of the above problem: Let the there be three kinds of balls labeled 1, 
2 and 3. So for this problem we have f\ = 1, f 2 = 2 and f% — 3. Further, 
we are given information regarding the expected value of each box, F. For 
our example this value will be, F = 2.3. Notice that this implies that on the 
average there are more 3's in each box. Next we take a sample of one of the 
boxes where m' l = 11, m' 2 = 2 and m' 3 = 7. 

Using the MrE method in the same way that we have in each of the previous 
sections, we arrive at a posterior solution after maximizing the proper entropy 
subject to the constraints, 

iW0i, e 2 ) = -J-e^- e ^9?9l(l -9 1 - e 2 ) 7 . (47) 

SMrE 

where the Lagrange multiplier f3 was determined using Newton's method on 
the equation PU|) and found to be f3 = 14.1166. We show the relationship 
between (3 and F in Fig 2. 

This result is then put into our calculation of CivirE so ^ na ^ CivirE = 1874.1247. 
Two plots are provided that show the marginal distributions of d\ and 6 2 (see 
Fig 3). One may choose to have a single number represent #i,#2 and #3- A 
popular choice is the mean, which is calculated for each marginal (see appendix 
for details), 

(0i) = 0.2942, (6 2 ) = 0.1115, (0 3 ) = 0.5942 (48) 

We now use the GJ solution (146]) to compute the "probabilities". We use the 
frequencies, m/n for Q or Q 1 = 11/20, Q 2 = 2/20 and Q 3 = 7/20 and assume 
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Fig. 3. These figures show the distributions of 6\ and 62 respectively. 

that F represents the sample average for the entire population of balls. This 
produces the following results: 



Clearly the results are very close, however, there are several drawbacks to us- 
ing the Sanov approach. The first is that P* is estimated on the basis of a 
frequency, Q that is being used to represent an estimate of the entire popu- 
lation. As is well known this can only be the case when n — > 00. MrE needs 
not make such assumptions. Similarly MrE can incorporate actual expectation 
values, not sample averages disguised as them. Second, the correct distribution 
to be used is the multinomial when one is counting, not the frequencies of the 
observables. Third, and practically most important, because the MrE solution 
produces a probability distribution, one can take into account fluctuations. 
A single number would not give any indication as to the uncertainty of the 
estimate. With our method, one has the choice of which estimator one would 
like to use. Perhaps the distribution is almost flat. Then our method would 
indicate that almost any choice is equally likely. There is an underlying theme 
here: probabilities are not equivalent to frequencies except in the asymptotic 
case. Therefore, if one wishes to know the probable outcome of a problem in 
all cases, use MrE. 



7 Conclusions 

The realization that the MrE method incorporates MaxEnt and Bayes' rule 
as special cases has allowed us to go beyond Bayes' rule and MaxEnt methods 
to process both data and expected value constraints simultaneously. There- 
fore, we would like to emphasize that anything one can do with Bayesian 



PI = 0.3015, P 2 * = 



0.0971, P 3 * = 0.6015. 



(49) 
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or MaxEnt methods, one can now do with MrE. Additionally, in MrE one 
now has the ability to apply additional information that Bayesian or MaxEnt 
methods could not. Further, any work done with Bayesian techniques can be 
implemented into the MrE method directly through the joint prior. 

It is not uncommon to claim that the non-commutability of constraints repre- 
sents a problem for the MrE method. Processing constraints in different orders 
might lead to different inferences. We have argued that on the contrary, the 
information conveyed by a particular sequence of constraints is not the same 
information conveyed by the same constraints in different order. Since dif- 
ferent informational states should in general lead to different inferences, the 
way MrE processes non-commuting constraints should not be regarded as a 
shortcoming but rather as a feature of the method. 

Two specific econometric examples were solved in detail to illustrate the ap- 
plication of the method. These cases can be used as templates for real world 
problems. Numerical results were obtained to illustrate explicitly how the 
method compares to other methods that are currently employed. The MrE 
method was shown to be superior in that it did not need to make asymptotic 
assumptions to function and allows for fluctuations. 

It must be emphasized that in the asymptotic limit, the MrE form is analogous 
to Sanov's theorem. However, this is only one special case. The MrE method is 
more robust in that it can also be used to solve traditional Bayesian problems. 
In fact it was shown that if there is no moment constraint one recovers Bayes 
rule. 

Acknowledgements: I would like to acknowledge valuable discussions with 
A. Caticha, M. Grendar, C. Rodriguez and E. Scalas. 



A Solving the normalization factor 

Here we show how the means (9i) , (62) and (#3) were calculated explicitly in 
the numerical solutions section. The program Maple was used to calculate all 
results after the integral from was created. 

In general, we rewrite the posterior in more detail, dropping the super- 
scripts, 



p„ ew (0) = - i)\{e^er- (a.i) 

S i i=l 
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where (' differs from ( in (139]) only by a combinatorial coefficient, 



C = /5(E^-l)fl^e^C ; • (A.2) 



A brute force calculation gives (' as a nested hypergeometric series, 

C' = e^/i(/ 2 (...(4-i))), (A.3) 
where each I is written as a sum of T functions, 



7, = rfo - a,) f ^±^L t f Ij+1 , (A.4) 



<y=0 



r(6 i + qj) qj] 



where Ik = 1. The index j takes all values from 1 to k — 1 and the other 
symbols are defined as follows: t,- = /3 (fk-j — fk)i a j — m 'k-j + 1 an d 

^ = ^+j + i + E^- E m * ' ( A - 5 ) 

with go = m 'o = 0- The terms that have indices = are equal to zero (i.e. 
bo = Qo = 0, etc.). A few technical details are worth mentioning: First, one can 
have singular points when tj = 0. In these cases the sum must be evaluated in 
the limit as tj — > 0. Second, since aj and bj are positive integers the gamma 
functions involve no singularities. Lastly, the sums converge because aj > bj. 
The normalization for the first example (1351) can be calculated in a similar 
way. 

Specifically for (1471) . the Lagrange multiplier (3 was determined using Newton's 
method on the equation ([40]) and found to be (3 = 14.1166. This result is 
then put into (1A.3I) in order to attain (' = 1874.1247. Next, the means were 
calculated by increasing + 1 and n + 1, then recalculating so that 

(01) = , 

(0 2 ) = C ^+i.»+i ) (A.6) 
(9 3 ) = 1 - (9 1 ) - (9 2 ) . 



Currently, for small values of k (less than 10, depending on memory) it is 
feasible to evaluate the nested sums numerically; for larger values of k it is 
best to evaluate the integral for (' using sampling methods. 
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